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Abstract 

We discuss necessary and sufficient conditions for an 
auto-encoder to define a conservative vector field, in 
which case it is associated with an energy function akin 
to the unnormalized log-probability of the data. We 
show that the conditions for conservativeness are more 
general than for encoder and decoder weights to be the 
same (“tied weights”), and that they also depend on the 
form of the hidden unit activation function, but that con¬ 
tractive training criteria, such as denoising, will enforce 
these conditions locally. Based on these observations, 
we show how we can use auto-encoders to extract the 
conservative component of a vector field. 


Introduction 

An auto-encoder is a feature learning model that learns to 
reconstruct its inputs by going though one or more capacity- 
constrained “bottleneck”-layers. Since it defines a mapping 
r : M" — 7 ^ K", an auto-encoder can also be viewed as 
dynamical system, that is trained to have fixed points at 
the data ( |Seung 1998 1. Recent renewed interest in the dy¬ 
namical systems perspective led to a variety of results that 
help clarify the role of auto-encoders and their relationship 
to probabilistic models. For example, ( Vincent et al. 20()^ 
Swersky et al. 20iT| l showed that training an auto-encoder 
to denoise corrupted inputs is closely related to performing 
score matching ( Hyvarinen 2005[) in an undirected model. 
Similarly, ( [Alain and Bengio 2014| l showed that training 
the model to denoise inputs, or to reconstruct them under 
a suitable choice of regularization penalty, lets the auto¬ 
encoder approximate the derivative of the empirical data 
density. And ( jKamyshanska 2013) 1 showed that, regardless 
of training criterion, any auto-encoder whose weights are 
tied (decoder-weights are identical to the encoder weights) 
can be written as the derivative of a scalar “potential-” or 
energy-function, which in turn can be viewed as unnormal¬ 
ized data log-probability. For sigmoid hidden units the po¬ 
tential function is exactly identical to the free energy of an 
RBM, which shows that there is tight link between these two 
types of model. 

The same is not true for untied auto-encoders, for which 
it has not been clear whether such an energy function ex¬ 
ists. It has also not been clear under which conditions an 


‘Authors constributed equally. 


energy function exists or does not exist, or even how to de¬ 
fine it in the case where decoder-weights differ from en¬ 
coder weights. In this paper, we describe necessary and suf¬ 
ficient conditions for the existence of an energy function and 
we show that suitable learning criteria will lead to an auto¬ 
encoder that satisfies these conditions at least locally, near 
the training data. We verify our results experimentally. We 
also show how we can use an auto-encoder to extract the 
conservative part of a vector field. 


Background 

We will focus on auto-encoders of the form 

r(x) = -b b)-b c ( 1 ) 

where x G M” is an observation, i? and W are decoder and 
encoder weights, respectively, b and c are biases, and h(-) is 
an elementwise hidden activation function. An auto-encoder 
can be identified with its vector field, r(x) — x, which is 
the set of vectors pointing from observations to their recon¬ 
structions. The vector field is called conservative if it can 
be written as the gradient of a scalar function F(x), called 
potential or energy function: 

r(x) — X = VF(x) (2) 


The energy function corresponds to the unnormalized prob¬ 
ability of data. 

In this case, we can integrate the vector field to find the 
energy function (Kamyshanska 2013 i. For an auto-encoder 


with tied weights and real-valued observations it takes the 
form 


F(x) = J h{u)du — 


const 


(3) 


where u = W'^x -b b is an auxiliary variable and h{-) can 
be any elementwise activation function with known anti- 
derivative. For example, the energy function of an auto¬ 
encoder with sigmoid activation function is identical to the 
(Gaussian) RBM free energy ([Hinton 2010|): 


Psigi^) = ^ log (1 + exp {W^x + 6 fc))--||x-c|| 2 +const 

k 

(4) 

A sufficient condition for the existence of an energy func¬ 
tion is that the weights are tied (Kamyshanska 20131, but it 























Figure 1: Encoder weights W (left) and decoder weights (right). 


has not been clear if this is also necessary. A peculiar phe¬ 
nomenon in practice is that it is very common for decoder 
and encoder weights to be “similar” (albeit not necessar¬ 
ily tied) in response to training. An example of this effect 
is shown in Figure HI This raises the question of why this 
happens, and whether the quasi-tying of weights has any¬ 
thing to do with the emergence of an energy function, and if 
yes, whether there is a way to compute the energy function 
despite the lack of exact symmetry. We shall address these 
questions in what follows. 

Conservative auto-encoders 

One of the central objectives of this paper is understanding 
the conditions for an auto-encoder to be conservative and 
thus to have a well-defined energy function. In the following 
subsection we derive and explain said conditions. 

Conditions for conservative auto-encoders 

Proposition 1. Consider an m-hidden-layer auto-encoder 
defined as 

r(x;6») = 

(• • • (x) • • • ) + -f 

where 9 = such that 9^^'i = are the 

parameters of the model, and is a smooth element¬ 

wise activation function at layer k. Then the auto-encoder 

*We found these kinds of behaviours not only for unwhitened, 
but also for binary data. 

^The expressions, “conservative vector field” and “conservative 
auto-encoders” will be used interchangeably. 


is said to be conservative over a smooth simply connect do¬ 
main K C if and only if its reconstruction’s Jacobian 
is symmetric for all x G K. 

A formal proof is provided in the Appendix. 

A region K is said to be simply connected if and only if 
any simple curve in K can be shrunk to a point. It is not al¬ 
ways the case that a region of is simply connected. For 
instance, a curve surrounding a punctured circle in can¬ 
not be continuously deformed to a point without crossing the 
punctured region. However, as long as we make the reason¬ 
able assumption that the activation function does not have a 
continuum of discontinuities, we should not run into trouble. 
This makes our analysis valid for activation functions with 
cusps such as ReFUs. 

Throughout the paper, our focus will be on one-hidden- 
layer auto-encoders. Although the necessary and sufficient 
conditions for their conservativeness are a special case of the 
above proposition, it is worthwhile to derive them explicitly. 

Proposition 2. Let r(x) be a one-hidden-layer auto¬ 
encoder with D dimensional inputs and H hidden units, 

r(x) = Rh (iF^x -f b) -I- c, 

where R, IF, b, c are the parameters of the model. Then r(x) 
defines a conservative vector field over a smooth simply con¬ 
nect domain K C if and only if RDh'W"'" is symmetric 
for allx G K where D^' = diag (h'{x)). 

Proof. Following proposition 1, an auto-encoder defines a 
conservative vector field if and only if its Jacobian is sym¬ 
metric for all X G AT. 

dr{x) ^ / 9r(x) \^ 
ax ^ ax y ^ ’ 












By explicitly calculating the Jacobian, this is equivalent to 

H 

{yi<i<j<D)Y, {RjiWh - RaWij)h'i (x) = 0 (6) 

1=0 

Defining = diag{h (x)), this holds if and only if 

RDi^,W^ = WDf^,R^ (7) 

□ 

For one-hidden-layer auto-encoders with tied weights, 
Equation|^holds regardless of the choice of activation func¬ 
tion h and x. 

Corollary 1. An auto-encoder with tied weights always de¬ 
fines a conservative vector field. 

Proposition 2 illustrates that the set of all one-layered tied 
auto-encoders is actually a subset of the set of all conser¬ 
vative one-layered auto-encoders. Moreover, the inclusion 
is strict. That is to say there are untied conservative auto¬ 
encoders that are not trivially equivalent to tied ones. As ex¬ 
ample, let us compare the parametrization of tied and conser¬ 
vative untied linear one-layered auto-encoders. TunUedi^) 
in Eq. [7] defines a conservative vector field if and only 

= W which offers a richer parametrization than 
the tied linear auto-encoder ruedi^) = VFVF^x. 

In the following section we explore in more detail and 
generality of the parametrization imposed by the conditions 
above. 

Understanding the symmetricity condition 

Note that if symmetry holds in the Jacobian of an auto¬ 
encoder’s reconstruction function, then the vector field is 
conservative. A sufficient condition for symmetry of the Ja¬ 
cobian is that R can be written 

R = CWDh'E. (8) 

where C and E are symmetric matrices, and C commutes 
with WDh' EDh'W"’', as this will ensure symmetry of the 
partial derivatives: 

=RDh'W^ = CWDh-EDh' (9) 

ax 

= WD/,,ED/,,W^C = WDh'R'^ = ■ 

The case of tied weights {R = W) follows if C and E are 
the identity, since then = RDh'W"'" = . 

Notice that R = CWDh' and R = WD^'E are further 
special cases of the condition R = CWDH’E when E is 
the identity (first case) or C is the identity (second case). 
Moreover, we can also find matrices E and C given the pa¬ 
rameters W and R, which is shown in Section 1.2 of the 
supplementary materiaQ 

’www.uoguelph.ca/~imj/files/conservative_ 
ae_supplementarY.pdf 


Conservativeness of trained auto-encoders 

Eollowing ( [Alain and Bengio 2014| l we will first assume 
that the true data distribution is known and the auto-encoder 
is trained. We then analyze the conservativeness of auto¬ 
encoders around fixed points of the data manifold. After that, 
we will proceed to empirically investigate and explain the 
tendency of trained auto-encoders to become conservative 
away from the data manifold. Einally, we will use the ob¬ 
tained results to explain why the product of the encoder and 
decoder weights become increasingly symmetric in response 
to training. 


Local Conservativeness 

Let r(x) be an auto-encoder that minimizes a contraction- 
regularized squared loss function averaged over the tme data 
distribution p, 

- 11^1 


r(x) = / P(x) 


^(x)-: 


dx (10) 


A point X G is a fixed point of the auto-encoder if and 
only if r(x) = x. 

Proposition 3. Let r(x) b e an untied one-layer auto¬ 
encoder minimizing Equation 10 Then r(x) is locally con¬ 


servative as the contraction parameter tends to zero. 

Taking a first order Taylor expansion of r(x) around a 
fixed point x yields 

, T 


'■(x -I- e) = X -(- 


dr lx) 
dx 


e o(e) as e —0. 


( 11 ) 


( [Alain and Bengio 2014| l shows that the reconstruction 
r(x) — X becomes an estimator of the score when ||r(x) — 
x ||2 is small and the contraction parameters A 0. Hence 
around a fixed point we have 


+ = and 


d{r{x -I- e) — x) 
dx 


= e- 


dx ’ 
log(p(x)) 


i9x^ 


( 12 ) 

(13) 


where I is the identity matrix. 

By explicitly expressing the 


Jacobian of the auto¬ 


encoder’s dynamics 
of r(x), we have 


dr(-x.) — y 
dx 


and using the Taylor expansion 


W^Df,,R^ - / = e 


a2log(p(x)) 

9x2 


(14) 


The Hessian of logp(x) being symmetric. Equation 14 
illustrates that around fixed points, RD^^' W is symmetric. In 
conjunction with Proposition 2, this shows that untied auto¬ 
encoders, when trained using a contractive regularizer, are 
locally conservative. Remark that when the auto-encoder is 
trained with patterns drawn from a continuous family, then 
auto-encoder forms a continuous attractor that lies near the 
examples it is trained on ( [Seung 1998| l. 

It is worth noting that dynamics around fixed points can 
be understood by analyzing the eigenvalues of the Jacobian. 
The latter being symmetric implies that its eigenvalues can¬ 
not have complex parts, which corresponds to the lack of 
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Figure 2: The symmetxicity distance of and the symmetricity distance of for sigmoid activation and ReLU activation 
are illustrated over the learning time of the auto-encoder. 


Table 1; Symmeticity of ADW after training AEs with 
500 units on MNIST for 100 epochs. We denote the auto¬ 
encoders with weight length constraints as ‘h-wF. 



ReLU 

ReLU-twl 

sig. 

sig.H-wl 

AE 

CAE 

95.9% 

95.2% 

98.7% 

98.6% 

95.1% 

97.4% 

99.1% 

99.1% 


oscillations one would naturally expect of a conservative 
vector held. Moreover, in directions orthogonal to the hxed 
point, the eigenvalues of the reconstruction will be negative. 
Thus the hxed point is actually a sink. 

Empirical Conservativeness 

We now empirically analyze the conservativeness of trained 
untied auto-encoders. To this end, we train an untied auto¬ 
encoder with 500 hidden units with and without weight 
length constraint^on the MNIST dataset. We measure sym¬ 
metricity using sym(A) = which yields values 

between [0,1] with 1 representing complete symmetricity. 
Figure|^and|^shows the evolution of the symmetricity of 
= RDh' W during training. For untied auto-encoders, 
we observe that the Jacobian becomes increasingly symmet¬ 
ric as training proceeds and hence, by Proposition 2, the 
auto-encoder becomes increasingly conservative. 

The contractive auto-encoder tends more towards symme¬ 
try than the unregularized auto-encoder. The reach plateaus 


around 0.951 and 0.974 respectively. It is interesting to note 
that auto-encoders with weight length constraints yield sen¬ 
sibly higher symmetricity scores as shown in Table The 
details of the experiments and further interpretations are pro¬ 
vided in the supplementary material. 

To explicitly conhrm the conservativeness of the auto¬ 
encoder in 2D, we monitor the curl of the vector field during 
training. In our experiments, we created three 2D synthetic 
datasets by adding gaussian white noise to the parametriza- 
tion of a line, a circle, and a spiral. As shown in Figure 
we notice that the curl decrease very sharply during train¬ 
ing, which further demonstrates how untied auto-encoders 
become more conservative during training. Hence, together 
with symmetricity measurement and decay of curliness ad¬ 
vocates that vector fields near the data manifold has the ten- 
dancy of becoming conservative. More results on line, circle, 
and spiral synthetic datasets can be found in the supplemen¬ 
tary materials. 


Symmetricity of weights product 

The product of weight RW^ tends to become increasingly 
symmetric during training. This behavior is more marked for 
sigmoid activations than for ReLUs as shown in Figures [2b| 
and 2d This can be explained by considering the Jacobian 
symmetricity. We approximately have 


H 


- RjiWu)h[{^) = 0,V1 < z, j < d (15) 


1=1 


"'Weight length constraints : ||wi||^ = a for all i = 1- ■ ■ H 
and a is a constant term. 


This implies that the activations of sigmoid hidden units, 
at least for training data points, are independent of h' (x) or 

























































Figure 3: Initial and final vector field after training untied auto-encoder on spiral dataset. 


a constant. 

As shown in the supplementary material, most hidden unit 
activities are concentrated in the highest curvature region 
when training with weight length constraints. This forces 
/ii(x) to be concentrated on high curvature regions of the 
sigmoid activation. This may be due to either h[{x) being 
nearly constant for all I given x, or h\ (x) being close to lin¬ 
early independent. In both cases, the Jacobian becomes close 
to the identity and hence RW^ « WR^. 


Decomposing the Vector Field 

In this section, we consider finding the closest conservative 
vector field, in a least square sense, to a non-conservative 
vector field. Finding this vector field is of great practical im¬ 
portance in many areas of science and engineering ( [Bhatia et 
|al. 2013| ). Here we show that conservative auto-encoders can 
provide a powerful, deep learning based perspective onto 
this problem. 

The fundamental theorem of vector calculus, also known 
as Helmhotz decomposition states that any vector field in 
can be expressed as the orthogonal sum of an irrotational 
and a solenoidal field. The Hodge decomposition is a gen¬ 
eralization of this result to high dimensional space (James 


19661. A complete statement of the result requires careful 


analysis of boundary conditions as well as differential form 
formalism. But since 1-forms correspond to vector field, and 
our interest lies in the latter, we abuse notation to state the 
result in the special case of 1 -forms as 


uj = da + Sf5 + 'y 


(16) 


where d is the exterior derivative, 6 the co-differential, and 
A 7 = 0|^ This means that any 1-form (vector field) can 
be orthogonally decomposed into a direct sum of a scalar, 
solenoidal, and harmonic components. 

This shows that it is always theoretically possible to get 
the closest conservative vector field, in a least square sense, 
to a non-conservative one. When applied to auto-encoders, 
this guarantees the existence of a best approximate energy 
function for any untied conservative auto-encoder. For a 


^For Laplace-deRham, A = dS + 5d. Standard A on 1-forms 
is d5. 


more detailed background on the vector field decomposition 
we refer to the supplementary material. 


Extracting the Conservative Vector Field through 
Learning 


Although the explicit computation of the projection might 
be theoretically possible in special cases, we propose to find 
the best approximate conservative vector through learning. 
There are several advantages to learning the conservative 
part of a vector field; i) Learning the scalar vector field com¬ 
ponent a from some vector field oj with an auto-encoder is 
straightforward due to the intrinsic tendency of the trained 
auto-encoder to become conservative, ii) although there is a 
large body of literature to explicitly compute the projections, 
these methods are highly sensitive to boundary conditions 
(Bhatia et al. 2013 1 , while learning based methods eschew 
this difficulty. 

The advantage of deep learning based methods over ex¬ 
isting approaches, such as matrix-valued radial basis func¬ 
tion kernels ( Macedo and Castro 2008| l, is that they can be 
trained on very large amounts of data. To the best of our 
knowledge, this is the first application of neural networks to 
extract the conservative part of any vector field, effectively 
recovering the scalar part of Eq. 


Two Dimensional space As a proof of concept, we first 
extract the conservative part of a two dimensional vector 
field F{x, y) = {—x + y, —x — y). The field corresponds to 
a spiralling sink. We train an untied auto-encoder with 1000 
ReLU units for 500 epochs using BFGS over an equally 
spaced grid of 100 points in each dimension. Figurej^clearly 
shows that the conservative part is perfectly recovered. 


High Dimensional space We also conducted experiments 
with high dimensional vector fields. We created a continum 
of vector fields by considering convex combinations of a 
conservative and a non-conservative field. The former is ob¬ 
tained by training a tied auto-encoder on MNIST and the 
latter by setting the parameters of an auto-encoder to ran¬ 
dom values. That is, we have {Wi, Ri) = /3(kFo, i?o) + (1 ~ 
ld){WK, Rk) where {Wq, Rq) is the non-conservative auto¬ 
encoder and {Wk,Rk) is the conservative auto-encoder. 
We repeatedly train a tied auto-encoder on this continuum 
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Figure 4: Vector field learning by tied (Middle) and united (Right) auto-encoder on 2D unconservative vector field (Left). 
Table 2: The fraction of observations with E{x) > i?(xjand) for different /3 values. 


n 

0.0 

0.2 

0.4 

0.6 

0.8 

1.0 

CVE=Tied AE 
CVE=Untied AE 

0.5036 

0.5072 

0.7357 

0.7496 

0.9338 

0.9373 

0.98838 

0.98595 

0.9960 

0.9958 

0.9968 

0.9968 


in order to learn its conservative part. The pseudocode for 
the experiment is presented in Algorithm Figure shows 


Algorithm 1 Learning to approximate a conservative field with 

an auto-encoder _ 

1 : procedure (V be a data set) 

2: Let {Wo, Ro) be a random weights for AE. 

3: Let (VFk, Rif) be trained AE on I?. 

4: Generate T) Vi = 1 • • • A as follows: 

. {Wi,R^) = P{Wo,Ro) + il-I3 ){Wk,Rk) 

• Sample Xi from uniform distributon in the data space. 

• Ri = {(xi, r(xj))fori = 1 • • • A} 

5: for each vector field Ti, do 

6: Train a tied Auto-encoder on Fi 

7: Compute E{x) where x € 72 

8: Compute E{x) where x ~ Binomial 

9: Count number of E{x) > E{x). 


the mean squared error as a function of training epoch for 
different values of /3. We observe that the auto-encoder’s 
loss function decreases as /3 gets closer to 1. This is due 
to auto-encoder only being able to learn the conservative 
component of the vector field. We then compare the unnor¬ 
malized model evidence of the auto-encoders. The compar¬ 
ison is based on computing the potential energy of auto¬ 
encoders given two points at a time. These two points are 
from the MNIST and a corrupted version of the latter using 
salt and pepper noise. We validate our experiments by count¬ 
ing the number of times where E{x) > £'(xrand)- Given that 
the weights (W/f, Rk) of the conservative auto-encoder are 
obtained by training it on MNIST, the potential energy at 
MNIST data points should be higher than that at the cor¬ 
rupted MNIST data points. However, this does not hold for 
/? < 1. Even for /3 = 0.6, we can recover the conservative 
component of the vector field up to 93% . Thus, we conclude 
that the tied auto-encoder is able to learn the conservative 
component of the vector field. The procedure is detailed in 
Algorithm [T] 



EPOCHS 


Figure 5: Learning curves for tied (dashed) and untied 
(solid) auto-encoders. 

Table shows that, on average, the auto-encoders po¬ 
tential energy increasingly favors the original MNIST point 
over the corrupted ones as the vector field F) moves from 0 
to K. “CVF=Tied AE” refers to conservative vector field 
Ek trained by tied auto-encoder and “CVE=Untied AE” 
refers to conservative vector field Ex trained by untied auto¬ 
encoder. 

Discussion 

In this paper we derived necessary and sufficient conditions 
for autoencoders to be conservative, and we studied why 
the Jacobian of the autoencoder tends to become symmetric 
during training. Moreover, we introduced a way to extract 
the conservative component of a vector field based on these 
properties of auto-encoders. 

An interesting direction for future research is the use of 
annealed importance sampling or similar sampling-based 
approaches to globally normalize the energy function val¬ 
ues obtained from untied autoencoders. Another interesting 
direction is the use of parameterizations during training that 
will automatically satisfy the sufficient conditions for con¬ 
servativeness but are less restrictive than weight tying. 
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Appendix 

Conservative auto-encoders 

This section provides detailed derivations of Proposition 1 
in Section 3. 

Proposition 1. Consider an m-hidden-layer auto-encoder 
defined as 


r(x;6») = 

(• • • (x) • • • ) + 


where 9 = such that 9^^'i = are the 

parameters of the model, and is a smooth element¬ 

wise activation function at layer k. Then the auto-encoder 
is said to be conservative over a smooth simply connect do¬ 
main K C if and only if its reconstruction’s Jacobian 
is symmetric for all x. G K. 


The high level idea is that simply finding the anti- 
derivative of an auto-encoder vector field as proposed 
in (Kamyshanska 2013 i does not work for untied auto¬ 
encoders. This is due to the difference in solving first or¬ 
der ordinary differential equations for tied auto-encoders 
and first order partial differential equations for untied auto¬ 
encoders. Therefore, here we present a different approach 
that uses differential forms to facilitate the derivation of the 
existence condition of a potential energy function in the case 
of untied auto-encoders. 


The advantage of differential forms is that they allow us 
to work with a generalized, coordinate free system. A dif¬ 
ferential form a of degree I (Z-form) on a smooth domain 
AT C is an expression; 


D 

a = ^fidxi. (17) 

i=l 

Using differential form algebra and exterior derivatives, we 
can show that the 1-form implied by an untied auto-encoder 
is exact, which means that a can be expressed as a = df3 
for some /3 G A^~^(K). Let a be the 1-form implied by the 
vector field of an untied auto-encoder. Then, we have 


D D 

a = Tidxi, and da = d{ri A dxi) (18) 
2=1 2=1 


where A is the exterior multiplication, d is the differential 
operatior on differential forms, and r( ) is the reconstruction 
function of the auto-encoder. Based on the exterior deriva¬ 
tive properties, i) if / G A°{K) then df — 
and ii) if a G A\K) and /3 G A"^{K) (Edelen 2011 1 then 


a/3 = (-l)'"^^a, 

D 

da = d{ri A dXi) 


i=l 

D 




dxj 
i,j=i ■> 


dn 


= — y ^^r—dxi A dxi 
^ dx4 
l<i<j<D 


( 20 ) 


^^dxi A dxj 


l<i<j<D 


dxi 


( 21 ) 


= E 

l<i<j<D 




dxi A dxj 


( 22 ) 


According to the Poincare’s theorem, which states that every 
exact form is closed and convsersely, if a is closed then it is 
exact in a simply connected region and a G A‘(K), where 
a is closed if da = 0. Then, by Poincare’s theorem, we see 
that 



(23) 


This is equivalent to requiring the Jacobian to be symmetric 
for all X G AT. 
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